fuzzywuzzy

Fuzzy Wuzzy was a bear…

  • fuzzy wuzzy had no hair
  • fuzzy wuzzy wasn’t really fuzzy
  • was he?

Access Presentation

Data cleaning challenge

  • We cannot prevent all inconsistencies across human beings who enter data

  • Inconsistent data entry isn’t always predictable

Challenge: two databases


  • Database 1: (the source of truth)

School Name Dept. of Ed. ID Number City State



  • Database 2: In yearly reports submitted by banks (over 1300 schools)

School Name Dept. of Ed. ID Number City State


…and lots of other information

Common Errors Included…

  • East Charter HS vs. East Charter High School
  • Overton Academy vs. Overton
  • Jackson High School vs. Robert Jackson High School
  • and many more combinations and possible errors

Solution: fuzzywuzzy Python package!

  • pip install fuzzywuzzy
  • Quantifies the similarity between strings

My approach:

  • For entries without a clear match, I filtered for the same city and state
  • Made matches with a fuzz ratio greater than 0.9

fuzzywuzzy Options

  1. Ratio (Simple Ratio) fuzz.ratio("Humpty Dumpty sat on a wall", "Humpty Dumpty Sat on a Wall!") >>> 91
  2. Partial Ratio fuzz.partial_ratio("Humpty Dumpty sat on a wall", "Humpty") >>> 100
  3. Token Set Ratio fuzz.token_set_ratio("Humpty Dumpty sat on a wall", "Humpty Humpty Dumpty sat on a wall") >>> 100
  4. Token Sort Ratio fuzz.token_sort_ratio("Humpty Dumpty sat on a wall","Dumpty Humpty wall on sat a") >>> 100

Source: Jash Data Sciences

Sample Code

!pip install fuzzywuzzy
from fuzzywuzzy import fuzz

# Two strings to compare
str1 = "Humpty"
str2 = "Humpty!"

# Calculate fuzz ratio
simple_ratio = fuzz.ratio(str1, str2)
print(f"The fuzz ratio is {simple_ratio}")
partial_ratio = fuzz.partial_ratio(str1, str2)
print(f"The fuzz partial ratio is {partial_ratio}")
token_set_ratio = fuzz.token_set_ratio(str1, str2)
print(f"The fuzz token set ratio is {token_set_ratio}")
token_sort_ratio = fuzz.token_sort_ratio(str1, str2)
print(f"The fuzz token sort ratio is {token_sort_ratio}")
Requirement already satisfied: fuzzywuzzy in /opt/anaconda3/lib/python3.9/site-packages (0.18.0)
The fuzz ratio is 92
The fuzz partial ratio is 100
The fuzz token set ratio is 100
The fuzz token sort ratio is 100

Math Foundation

The Levenshtein Distance represents the least number of edit operations that are necessary to modify one string to obtain another string


Examples:

String1 String2 Levenshtein Distance Fuzz Ratio
CAT PAT 1 67
DOG FOG 1 67
APPLE APPEAL 2 77
PYTHON JAVASCRIPT 7 44
CAR CARROT 3 50

Comparison of Fuzz Ratios

References / Helpful Resources

Access presentation: